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ABSTRACT 


Many modern anatomy curricula teach histology using vir- 
tual microscopes, where students inspect tissue slices in a 
computer program (e.g. a web browser). However, the edu- 
cational data mining (EDM) potential of these virtual micro- 
scopes remains under-utilized. In this paper, we use EDM 
techniques to investigate three research questions on a vir- 
tual microscope dataset of N = 1,460 students. First, which 
factors predict the success of students locating structures in 
a virtual microscope? We answer this question with a gener- 
alized item response theory model (with 77% test accuracy 
and 0.82 test AUC in 10-fold cross-validation) and find that 
task difficulty is the most predictive parameter, whereas stu- 
dent ability is less predictive, prior success on the same task 
and exposure to an explanatory slide are moderately pre- 
dictive, and task duration as well as prior mistakes are not 
predictive. Second, what are typical locations of student 
mistakes? And third, what are possible misconceptions ex- 
plaining these locations? A clustering analysis revealed that 
student mistakes for a difficult task are mostly located in 
plausible positions ("near misses’) whereas mistakes in an 
easy task are more indicative of deeper misconceptions. 
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1. INTRODUCTION 


Histology is a core subject that all medicine students have to 
pass in their studies. An important part of classic histology 
training is the microscopy course where students examine a 
large number of slides of human or animal tissue with an op- 
tical microscope in order to identify cellular structures with 
the aim of establishing structure-function relationships [21]. 
In recent years, more and more virtual microscopes (VMs) 
have been developed and integrated into teaching [21]. Such 
VMs reduce the need for resources (students only require a 
computer and a software), offer the opportunity to annotate 
slides with teacher notes, and enhance the student learn- 
ing experience [21]. Prior work has provided numerous case 
studies of VMs being successfully integrated into anatomy 
education around the globe, e.g. [5, 6, 10, 13, 21, 22]. More- 
over, several evaluation studies have shown that students 
using VMs perform at least as well as students using optical 
microscopes [11, 15]. 


To the best of our knowledge, no study to date has con- 
sidered the educational data mining potential of VMs. For 
example, VMs enable us to record which slides students have 
seen, which areas on the slides they have focused on, etc. In 
this work, we consider the MyMi.mobile VM that is used 
in anatomy courses at two German universities [10]. In this 
VM, students can view a slide with expert annotations (ex- 
ploration), and they can test their knowledge by either locat- 
ing a structure in a slide (structure search; refer to Figure 1), 
or identifying the tissue sample and staining (diagnosis). 
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We analyze the performance of N = 1,460 students in struc- 
ture search tasks with respect to three research questions: 


RQ1: Which features predict student success? 
RQ2: What are typical locations of student mistakes? 


RQ3: What are possible misconceptions explaining these lo- 
cations? 


To answer RQ1, we analyzed the collected learning data with 
a generalized item response theory [2, 12] model, which con- 
sists of a difficulty parameter for each task, an ability pa- 
rameter for each student, and four weights for additional 
features (see section 3.2). To answer RQ2 and RQ3, we 
employed a Gaussian mixture model [7] on the locations of 
mistakes and interpreted the resulting clusters with the help 
of domain experts. Our results can contribute to enhanced 
teaching quality in VM courses as well as establish inter- 
pretable models to analyze data from such courses. 


In the remainder of this paper, we cover related work, our 
experimental setup, the results, and a conclusion. 


2. RELATED WORK 


Prior work on machine learning on virtual microscope data 
has focused on applications outside education. For example, 
major prior work has been done in training convolutional 
neural networks to solve classification tasks on microscope 
images such as detecting fluorescence on images [4]. Suc- 
cessful applications can assist anatomy experts in predicting 
carcinogens in human cells [23]. Due to the high accuracy 
of these models [1], they are helpful in cancer diagnostics. 


Related to education, prior work of virtual microscopes can 
be roughly distributed into two categories. First, there are 
case studies describing how virtual microscopes were inte- 
grated into anatomy curricula and the requirements for suc- 
cessful integration, e.g. [5, 6, 10, 13, 21, 22]. Second, several 
studies have investigated whether students with optical mi- 
croscopes have higher learning gain compared to students 
with a virtual microscope and found that this is not the 
case, e.g. [11, 15]. 


One of our research questions in this paper is to identify 
factors that are related to success in locating structures in 
a virtual microscope. Models that predict student success 
are a common topic of educational data mining research [3]. 
For example, Dietz-Uhler et al. [8] summarized which kind 
of data is often used to predict students success, classified 
into data gathered from the Learning Management System 
(e.g. clicks on resources) and performance data (e.g. feed- 
back or grades, created by the instructor or respectively the 
system). Other papers use demographic data and prior suc- 
cess to predict success rates, e.g. [16]. Prior work has shown 
that, depending on the knowledge domain, different features 
have high importance to predict students’ success. For ex- 
ample, Ramos et al. [20] found that hits in a discussion forum 
have high importance to predict students success. Yuksel- 
turk et al. [24] used a correlational research design and 
concluded that self-regulation variables have a highly sta- 
tistically significant relation to learning success using inter- 
pretable methods. To our best knowledge, there is no prior 
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Figure 1: Screenshot of the MyMi.mobile structure search 
mode. 


work on success prediction in virtual microscopes. We want 
to close this research gap. 


To do so, we turn to item response theory. Item response 
theory is concerned with modeling the probability of success 
of a student 7 at a task 7 via a logistic distribution over the 
difference between a student’s ability parameter 0; and a 
task’s difficulty parameter b; [2, 12]. Generalizations of this 
model include more parameters and other distributions [2, 
12]. In this paper, we use the standard logistic distribution 
but include auxiliary parameters for features that capture 
student behavior. 


To analyze the locations of typical mistakes, we perform a 
clustering analysis using Gaussian mixture models [7]. Clus- 
tering is a well-established technique in educational data 
mining [3], e.g. to identify groups of student solutions that 
may warrant similar feedback [9]. Our reasoning is similar: 
We wish to identify typical locations of mistakes in structure 
searches such that we have a reasonably sized set of repre- 
sentative locations that a teacher can inspect and for which 
feedback may be developed. 


3. METHOD 


3.1 MyMi.mobile VM and Dataset 


The MyMi.mobile VM provides three modes: exploration, 
which shows expert annotations, structure search, where stu- 
dents need to locate a structure in a slide, and diagnosis, 
where students need to identify the slide and the stain. The 
structure search mode is shown in Figure 1. Students see 
a tissue slice and are supposed to move the field of view 
(by panning and zooming) until the crosshair is located over 
the correct structure. Then, they confirm their choice by 
clicking the arrow on the bottom right. As additional in- 
terface elements, students see an explanatory text at the 
bottom of the screen (“Position the area to be searched in 
the center of the screen and confirm your decision by press- 
ing the ’continue-button’. Start now!”), a ’minimap’ of the 
slide on the bottom left, and a timer on the top right. Stu- 
dents can select structure searches in any order from a list 
sorted alphabetically according to the slides (e.g. armpit, 
eye, colon)’. Students can attempt the same search as many 


‘The alphabetical ordering probably introduces an ordering 
bias. In particular, we observe that the two most attempted 


560 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


. Si e dy». St 
BilypCcessPlore tation, Uden, ; lasy. Z 


EEE 
pa [oe os | 


Figure 2: Illustration of the feature vector Zz (top) for an 
attempt of student i on task j, and the parameter vector w 
(bottom) of the item response theory model. 


times as they want. 


We consider a dataset of 19,525 structure search attempts 
by 1,460 students recorded in the summer term 2020 at two 
German universities. Most students were second semester 
undergraduate students of medicine, with some students 
from fourth semester dentistry (45) and molecular medicine 
in the second or fourth semester (39). Most students (817) 
did not attempt any structure search. Of the 643 who did, 
most attempted few structure searches (median 7) with some 
*heavy users’ making hundreds of attempts (mean 30.37, 
maximum 649). 68.19% of attempts were successful. 


For the purpose of validating our model, we also asked four 
anatomical teachers using the VM to rate the difficulty of 
the 30 most attempted structure search tasks on the plat- 
form. The teachers received the following instruction: any 
structure search that at least 65% of students are expected 
to solve on their first try should be rated as ’easy’; any struc- 
ture search between 40 — 65% should be rated ’moderate’; 
and any structure search below 40% should be rated as ’dif- 
ficult’. These boundaries were chosen based on the actual 
success rates of students: 10 of the tasks had an actual suc- 
cess rate over 65%, 10 had an actual success rate between 
40% — 65%, and 10 had an actual success rate below 40%. 


3.2 Item Response Theory 

In order to investigate RQ1, we trained a generalized item re- 
sponse theory model implemented via logistic regression. In 
particular, we pre-processed each structure search attempt 
to be represented as a 1,859 dimensional, highly sparse fea- 
ture vector (see Figure 2). The first four dimensions (gray) 
contain auxiliary features, namely: 1) How often has the 
student failed on the same structure search? (failures) 2) 
How often has the student succeeded on the same struc- 
ture search? (successes) 3) Has the student already seen 
the same slide in the exploratory mode? (explored), and 4) 
How many minutes has the student spend on the current 
structure search? (duration). The next 1,460 dimensions 
(blue) indicate which student made the attempt, i.e. fea- 
ture r44; = 1 if the current attempt was made by student 
i € {1,...,1460} and 0 otherwise. The remaining 395 fea- 
tures (orange) indicate which task the attempt was made 
on, i.e. feature ©14644; = 1 if the current attempt was made 
on task j € {1,...,395} and 0 otherwise. 


structure searches are both on the alphabetically first slide. 


Our model, then, has the form 


1 
+ exp(—wr? - Z) 


1 
1+ exp(b; — 0; —wi-a1...— wa- £4) 


where Z is the sparse feature vector of an attempt and W is 
the parameter vector (see Figure 2). Note that we obtain a 
classic IRT model if the first four features 71, v2, 73, and 
x4 are 0. We used the implementation of logistic regression 
from the scikit-learn library [19]. 


3.3 Clustering 

To investigate RQ2 and RQ3, we applied clustering on the 
locations of mistakes. More specifically, we used a Gaussian 
mixture model with kK components, which approximates the 
probability density over locations (x,y) of mistakes in an 
image as 


K 


p(x, y) = S>N((a, y) |e, Ex) ‘Tk, (2) 


k=1 


where N((x,y)|/ix, 2x) denotes the 2D Gaussian density 
with mean fi, € R° and covariance matrix Uy, € R?*?, and 
where 7, € [0,1] is the prior for the kth Gaussian compo- 
nent. Compared to other clustering algorithms, Gaussian 
mixtures have at least two advantages. First, they can deal 
with non-spherical clusters by adjusting the covariance ma- 
trix accordingly. Second, they provide a probability density 
of the data. Moreover, they remain fast to train with an 
expectation maximization scheme [7]. We use the scikit- 
learn implementation of Gaussian mixtures [19]. To select 
the optimal number of components K, we use the Bayesian 
information criterion [18]. 


4. RESULTS AND DISCUSSION 


In this section, we present the results of our experiments. 
We begin with the teacher difficulty ratings, then continue 
with the item response theory model (regarding RQ1), and 
conclude with the clustering analysis (regarding RQ2 and 


RQ3). 


4.1 Teacher difficulty ratings 

As the result of our teacher survey, we obtained difficulty 
ratings (’easy’, ’moderate’, or difficult’) for the 30 most at- 
tempted structure search tasks. We observe that the teach- 
ers agreed moderately. On average, the Kendall 7 for pair- 
wise agreement is 0.4 and the overall Krippendorff’s a is 
0.44. To enhance reliability, we consider the average rating 
of each task in the subsequent analysis. On average, teach- 
ers ranked most tasks as ’easy’ (about 55%), fewer as ’mod- 
erate’ (just below 35%), and very few as ‘difficult’ (about 
10%; refer to blue bars in Figure 3). Recall that, according 
to actual success rate, all blue bars would have height 1/3. 
This indicates that teachers tended to underestimate the ac- 
tual difficulty, which may be an instance of the ’expert blind 
spot’, i.e. the phenomenon that experts may fail to imagine 
the difficulties of novices [17]. We will use the teacher rat- 
ings as reference to further validate our item response theory 
model below. 
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Figure 3: Left: The frequency (blue) and the mean actual 
success rate (orange) of tasks rated as easy, moderate, or 
difficult by teachers. Right: The average difficulty parameter 
assigned by the model to tasks rated as easy, moderate, or 
difficult by teachers. 
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Figure 4: Calibration plot of the IRT model. Dashed lines 
indicate bin width. The color indicates how full each bin is. 


4.2 Factors of student success 

In order to investigate which factors contribute to student 
success (RQ1), we trained an item response theory model 
(refer to Section 3.2) on our data. 


Model validation To validate the model, we performed 
three analyses. First, we performed a 10-fold cross-validation 
over attempts, yielding 80.19% + 0.13% training accuracy 
and 77.73% + 1.83% test accuracy on average + standard 
deviation. Because our data is imbalanced (with less fail- 
ures than successes), we also considered AUC (0.86 + 0.001 
in training and 0.82+0.02 in test), and F1 score (0.690.003 
in training and 0.66 + 0.024 in test with a test precision of 
0.73 + 0.06 and a test recall of 0.60 + 0.03). All measures 
indicate good generalization from training to test set. For 
the remainder of this section, we consider a model trained 
on all data. 


Second, we assessed model calibration. Calibration means 
that the predicted success probability of a student corre- 
sponds to the actual success rate [14]. To analyze this, we 
aggregated data into bins according to the predicted suc- 
cess probability (each bin had a width of 10%) and then 
computed the actual success rate within each bin. Figure 4 
shows the corresponding calibration curve, where the dashed 
lines indicate the width of each bin in the analysis. Given 
that the curve remains within the dashed zone, we con- 
clude that our model was well-calibrated. Most predictions 


weight 


! 
0.4 - + 
02] i a 
0 — 
T T T T 
failures successes explored duration 


Figure 5: The scaled weights of auxiliary features. 


(27.5%) were in the 90% — 100% bin (orange dot), i.e. our 
model predicted successful attempts with high confidence. 


Third, we compared the difficulty parameters of our model 
with the human ratings from Section 4.1. Figure 3 (right) 
displays the average difficulty parameter assigned by the IRT 
model for each difficulty class. We observe that tasks rated 
as more difficult by teachers were also rated as more difficult 
by the model. Tasks rated as ’easy’ by the teachers have a 
mean difficulty parameter of 0.5, tasks rated as ’moderate’ 
have a mean difficulty parameter of 2, and tasks rated as 
difficult’ a mean parameter of 3. 


Overall, we note that the model is reasonably accurate, well- 
calibrated, and agrees with teacher ratings of difficulty. 


Factors to Success Next, we inspect the weights of our 
model to infer which features are predictive of student suc- 
cess. To make the weights comparable, we normalized the 
auxiliary features to the same scaling as the binary features. 


Regarding auxiliary features (Figure 5), we observe that the 
number of prior failures had a low negative weight, i.e. it is 
not predictive of student success. This is likely explained by 
the design of the MyMi.mobile VM. On a failure, students 
only learned that they were wrong but not where the right 
answer might be. This ensures that students can not get 
the right answer by trial and error. Attempt duration also 
had a low negative weight. This may be because duration 
is an ambiguous feature. Students may take longer both for 
productive reasons — e.g. inspecting the slide in more detail 
to validate the image against the definition of the structure 
— and unproductive reasons — e.g. being distracted. Ac- 
cordingly, duration may not provide predictive information 
either way. 


By contrast, we obtained positive scaled weights for the suc- 
cesses (0.39) and explored (0.47) features. The explanation 
for the former is obvious: If you have found the correct 
solution for the task once, chances are you memorized the 
location and can find it again. An explanation for the latter 
is that having seen an annotated example of the structure 
helps to find another instance of it in a structure search. 
That being said: We can not make causal inferences in this 
model. It is also possible that students who are more likely 
to succeed for other reasons are also more likely to consult 
the exploratory slides. On the other hand, we account for 
a general underlying student ability via the student ability 
parameter (Figure 6). 


We observe that student ability parameters vary in the range 
from —1.97 to 1.55 and most parameters are clumped around 
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Figure 6: The success rate vs. the student ability of structure 
searches. Each dot represents a student. Color indicates the 
number of attempted structure searches by a student. 
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Figure 7: The success rate vs. the task difficulty of structure 
searches. Each dot represents a task. The heatmap colors 
indicate the number of attempts of a given task. 


0 + 0.54 (Figure 6). We also observe that the correlation 
between the ability parameter and actual success rate is rel- 
atively weak (Kendall + = 0.52). To further investigate 
the role of student ability, we performed another 10-fold 
cross validation over students instead of attempts, i.e. we 
tried to generalize to students that the model had never 
seen before and who thus had an ability parameter of 0. 
In this setting, we still obtained an average training accu- 
racy of 80.19% + 0.13% and an average test accuracy of 
77.93% + 1.83%, indicating that the student ability param- 
eters contributed little to an accurate prediction. We have 
two possible explanations for this finding: First, it may just 
be that the underlying ’true’ student ability is relatively uni- 
form because almost all students were in the same semester 
at the same two universities. Second, student ability may 
change during usage of the microscope, such that a single 
parameter may not be able to capture student ability par- 
ticularly well. 


Finally, we find that the task difficulty had the clearest re- 
lation to success compared to the other features. As shown 
in Figure 7, parameters range from —2.22 to 3.19 (mean 
0+1.19) and anti-correlate very well with the actual success 
rate (Kendall r = —0.91). This indicates that tasks had a 
roughly consistent difficulty across students. It also explains 


how our IRT model generalized well to new students. 


In summary, we observe that prior success on the same task, 
having seen the corresponding exploratory slide, and task 
difficulty were most predictive of student success, whereas 
student ability was only moderately predictive and prior fail- 
ures as well as duration were not predictive. 


4.3 Typical mistakes 

To investigate RQ2 and RQ3, we consider the two most at- 
tempted structure search tasks, namely searching for the 
nucleus of a myoepithelial cell and searching for an apoc- 
rine gland in human armpit tissue (refer to Figure 8 left and 
right, respectively). 


The myoepithelial cell search (Figure 8, left) was the hardest 
task in the whole dataset with only 16.87% correct guesses 
(shown as green dots), with a difficulty parameter of 3.19, 
and unanimous consent of all four experts that it is diffi- 
cult. Figure 8 (left) illustrates why the task is difficult: The 
correct regions (in green) are small and hard to spot. 


By contrast, the slide for the apocrine gland task (Figure 8, 
right) exhibits many and large correct regions. Accordingly, 
57.72% of guesses were correct (green dots), the model as- 
signed a lower difficulty rating (1.28), and all experts agreed 
that this task is easy. 


To identify typical mistakes, we trained a 10-component? 
Gaussian mixture model to cluster all the mistake locations 
(shown as blue dots). The cluster means are plotted as or- 
ange shapes in Figure 8. Interestingly, most clusters for the 
myoepithelial cell search task, namely the orange squares in 
Figure 8 (left) could plausibly be cell cores of myoepithe- 
lial cells. The bottom-most orange diamond is also located 
near a correct region. Only the remaining orange diamonds 
are clearly wrong because they are not located at cell cores. 
Generally, many students seemed to have a correct under- 
standing of the structure to be found but failed to spot un- 
ambiguously correct locations. 


By contrast, the cluster means for the apocrine gland search 
(Figure 8, right) indicate deeper misconceptions. All cluster 
centers are clearly wrong. More specifically, the diamond 
in the bottom right corresponds to an eccrine instead of 
andocrine gland, and the center diamond corresponds to a 
broken structure. 


In both tasks, we can use cluster centers as a tool to find 
typical misconceptions that need to be discussed in class. 


5. CONCLUSION 


In this paper, we investigated three research questions re- 
garding structure search tasks in virtual microscopes, namely 
1) Which features predict student success? 2) What are typ- 
ical locations of student mistakes? 3) What are underlying 
misconceptions explaining these locations? 


?We observed that only little improvement in Bayesian infor- 
mation criterion could be achieved for more than 10 compo- 
nents. We also observed that 10 components were sufficient 
such that some components ended up unused in Figure 8. 
For other slides, different numbers may be needed. 
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Figure 8: Students’ correct (green) and wrong (blue) guesses on structure searches for myoepithelial cell cores (left) and apocrine 
glands (right). Correct structures are outlined in green. The centers of mistake clusters are orange shapes. 


To answer the first question, we trained a generalized item 
response theory (IRT) model, obtaining 77% accuracy and 
0.82 AUC in 10-fold cross-validation as well as solid cal- 
ibration. Of the features considered, we found that task 
difficulty was particularly predictive of student success and 
the obtained difficulty parameters aligned well with actual 
student success rates and expert ratings. We observed less 
predictive value of student ability, illustrated by the fact 
that IRT models could generalize without loss of accuracy 
to new students. Moreover, prior success on the same task 
and having seen an annotated version of the same histolog- 
ical slide were predictive of success, whereas prior failures 
and duration spent on the task were not. This is interesting 
because it suggests that time stamps could be removed from 
the data, enhancing the privacy of the system. 


Regarding the second and third research question, we ap- 
plied clustering on mistake locations and interpreted the 
cluster centers in terms of misconceptions that may have led 
students to wrongly click at these locations. Such miscon- 
ceptions can then be discussed in class to improve students’ 
learning, or can be used to provide adaptive feedback in the 
virtual microscope tool. 


Overall, this work represents the first step towards educa- 
tional data mining on virtual microscope data with results 
that can be used to improve virtual microscope education, 
e.g. by ordering structure searches according to difficulty, by 
discussing typical misconceptions in class, and by enhancing 
annotations. Further work remains to be done, though. In 
particular, more features should be included to both enhance 
accuracy and find educational interventions that support 
student performance (like the exploratory view). Further, 
one could include relations between tasks in the model, thus 
identifying tasks that share an underlying skill, and extend 
the analysis to more advanced knowledge tracing methods. 
Finally, convolutional neural networks could be utilized to 
generalize teacher annotations and to identify regions of im- 
ages that are easy to confuse with a structure to be searched. 
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